TEG: GPU Performance Estimation Using a Timing Model
نویسندگان
چکیده
Modern Graphic Processing Units (GPUs) offer significant performance speedup over conventional processors. Programming on GPU for general purpose applications has become an important research area. CUDA programming model provides a C-like interface and is widely accepted. However, since hardware vendors do not disclose enough underlying architecture details, programmers have to optimize their applications without fully understanding the performance characteristics. In this paper we present a GPU timing model to provide more insights into the applications’ performance on GPU. A GPU CUDA program timing estimation tool (TEG) is developed based on the GPU timing model. Especially, TEG illustrates how performance scales from one warp (CUDA thread group) to multiple concurrent warps on SM (Streaming Multiprocessor). Because TEG takes the native GPU assembly code as input, it allows to estimate the execution time with only a small error. TEG can help programmers to better understand the performance results and quantify bottlenecks’ performance effects. Key-words: GPGPU, CUDA, Performance Estimation, Analytical Model ha l-0 06 41 72 6, v er si on 1 16 N ov 2 01 1 TEG: GPU Performance Estimation Using a Timing Model Résumé : Dans ce rapport, nous proposons une modélisation de la microarchitecture d’un GPU afin d’offrir une meilleure compréhension des performances d’une application sur le GPU. TEG est un outil d’estimation de temps d’exécution de programme basé sur cette modélisation. Mots-clés : GPGPU, CUDA, Performance Estimation, Analytical Model ha l-0 06 41 72 6, v er si on 1 16 N ov 2 01 1 TEG: GPU Performance Estimation Using a Timing Model 3
منابع مشابه
Implementation of the direction of arrival estimation algorithms by means of GPU-parallel processing in the Kuda environment (Research Article)
Direction-of-arrival (DOA) estimation of audio signals is critical in different areas, including electronic war, sonar, etc. The beamforming methods like Minimum Variance Distortionless Response (MVDR), Delay-and-Sum (DAS), and subspace-based Multiple Signal Classification (MUSIC) are the most known DOA estimation techniques. The mentioned methods have high computational complexity. Hence using...
متن کاملParallel Implementation of Particle Swarm Optimization Variants Using Graphics Processing Unit Platform
There are different variants of Particle Swarm Optimization (PSO) algorithm such as Adaptive Particle Swarm Optimization (APSO) and Particle Swarm Optimization with an Aging Leader and Challengers (ALC-PSO). These algorithms improve the performance of PSO in terms of finding the best solution and accelerating the convergence speed. However, these algorithms are computationally intensive. The go...
متن کاملThe performance of a combined solar photovoltaic (PV) and thermoelectric generator (TEG) system
The performance of a combined solar photovoltaic (PV) and thermoelectric generator (TEG) system is examined using an analytical model for four different types of commercial PVs and a commercial bismuth telluride TEG. The TEG is applied directly on the back of the PV, so that the two devices have the same temperature. The PVs considered are crystalline Si (c-Si), amorphous Si (a-Si), copper indi...
متن کاملAccelerating high-order WENO schemes using two heterogeneous GPUs
A double-GPU code is developed to accelerate WENO schemes. The test problem is a compressible viscous flow. The convective terms are discretized using third- to ninth-order WENO schemes and the viscous terms are discretized by the standard fourth-order central scheme. The code written in CUDA programming language is developed by modifying a single-GPU code. The OpenMP library is used for parall...
متن کاملMultiprocessing GPU Acceleration of H.264/AVC Motion Estimation under CUDA Architecture
Abstract— This work presents a parallel GPU-based solution for the Motion Estimation (ME) process in a video encoding system. We propose a way to partition the steps of Full Search block matching algorithm in the CUDA architecture, and to compare the performance with a theoretical model and two implementations (sequential and parallel using OpenMP library). We obtained a O(n2/log2n) speed-up wh...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2011